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Abstract - The automatic disambiguation of word senses (i.e., the identification of which of 
the meanings is used in a given context for a word that has multiple meanings) is essential for 
such applications as machine translation and information retrieval, and represents a key step for 
developing the so-called Semantic Web. Humans disambiguate words in a straightforward fash- 
ion, but this does not apply to computers. In this paper we address the problem of Word Sense 
Disambiguation (WSD) by treating texts as complex networks, and show that word senses can 
be distinguished upon characterizing the local structure around ambiguous words. Our goal was 
not to obtain the best possible disambiguation system, but we nevertheless found that in half of 
the cases our approach outperforms traditional shallow methods. We show that the hierarchical 
connectivity and clustering of words are usually the most relevant features for WSD. The results 
reported here shine light on the relationship between semantic and structural parameters of com- 
plex networks. They also indicate that when combined with traditional techniques the complex 
network approach may be useful to enhance the discrimination of senses in large texts. 
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Introduction. — Many statistical methods are now 
used to investigate language [l] in attempts to under- 
stand empirical findings such as the Zipf's Law [2], and 
model syntactic and semantic relationships between words 
or passages [3] -|13| . The numerous studies encompass com- 
plex networks [14| (CN) representing texts in several appli- 
cations, including summarization [sj, assessment of qual- 
ity of machine translators [6j, authorship recognition [t], 
keyword extraction [s], topic identification [Ts] and seg- 



From an analysis of feature relevance, we found that the 
strength of connection of neighbors in higher hierarchies 
and the hierarchical clustering coefficient are the most ef- 
ficient metrics to discriminate word senses. 

Typical Approaches to Word Sense Disambigua- 
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mentation 
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In this paper we assess the use of complex 
network concepts for Word Sense Disambiguation (WSD), 
which is a crucial task for the Semantic Web |17j and for 



machine translation 18 . We analyze 10 ambiguous words 
with 16 topological measurements and show a strong re- 
lationship between senses and local features of complex 
networks. Indeed, for some of the ambiguous words the 
distinguishability with the CN approach is better than 
that obtained with the traditional analysis of neighbors. 



tion. — The WSD problem has been widely studied 
by computer scientists and researchers interested in Nat- 
ural Language Processing |2j tasks. Even though humans 
can readily discriminate specific senses of a word, this is 
not the case of a computer. In fact, WSD is considered as 
one of the most complex problems in Artificial Intelligence 
[20] . The two conventional approaches to WSD are: i) the 
deep paradigm based on a large amount of linguist knowl- 
edge (e.g. dictionaries, thesaurus or semantic networks); 
and ii) the shallow paradigm which makes use of statistical 
techniques. The deep paradigm is in theory the best strat- 
egy as it mimics human thinking, but in practice methods 
requiring knowledge bases do not achieve the best perfor- 
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mance because there is still no database that can cover 
the human knowledge. Moreover, this paradigm is often 
impracticable because the manual creation of knowledge 
bases is an expensive, time consuming endeavor. In con- 
trast, simpler methods such as those based on the anal- 



ysis of contexts surrounding ambiguous words 21 have 



led to better performance. One of the most popular algo- 
rithms, referred to as Lesk [22] , assumes that words in a 
given neighborhood tend to share a common topic, an as- 



sumption that is used by other algorithms 19 . Actually, 



the analysis of contexts based on the recurrence of nearby 
components is so efficient that it has even been employed 



to decipher encoded manuscripts 23 



Networks have also been applied to the WSD task, and 
some of the network-based algorithms are now close to 
the state-of-the-art in disambiguation. One of the earli- 
est works date back to 1968 and uses the network struc- 
ture to store knowledge in the form of a semantic mem- 
ory [241. Other examples include the application of ran- 
dom walks [25^ in semantic networks whose nodes are 
linked according to semantic relations provided by Word- 
Net 26 . With a different approach, the HyperLex algo- 



rithm 27 connect words that co-occur in a given para- 



graph and use the weight of edges (given by the relative 
frequency of occurrence of the corresponding connected 
nodes) to disambiguate words. Although these algorithms 
use the network representation in processing steps, they 
differ substantially from our strategy because they all con- 
sider the label of nodes while we focus on the characteri- 
zation of local structural properties. 

Methodology. 

Database. In the experiments, we used a set of 18 
books to retrieve 10 ambiguous words (save, note, march, 
present, jam, ring, just, bear, rock and close), which were 
manually disambiguated. The only criterion in choosing 
these words was the quite distinct meanings of each word, 
which minimizes possible inaccuracies in the manual dis- 
ambiguation. The list of word senses and books are given 
respectively in Tables SI and S2 of the Supplementary In- 



formation] (SI). The text in the books was represented as 
networks, as explained below. 

Modeling Texts as Complex Networks. The model 
used to represent text is known in the literature as co- 
occurrence or adjacency networks [6j[7[. Basically, words 
are represented as nodes, which are directionally linked 
according to the natural reading order. In other words, if 
a word i appears immediately before word j in the text, 
then there will be the i j edge in the network. When a 
given association i j is repeated in the text, the weight 
of the corresponding edge is incremented. Before creat- 
ing nodes and edges, stopwords (prepositions, articles and 
other high-frequency words with little semantic meaning) 
are removed (the full list of disregarded stopwords is shown 
in the SI). In addition, the remaining words are converted 
to their canonical form in order to group words with dif- 
ferent inflections referring to a same concept. 



Mathematically, the text network is defined by the ma- 
trix W — {wij}, whose element Wij counts the number of 
times the word i appeared before the word j. When defin- 
ing some of the complex networks measurements we also 
employed the non-oriented, non-weighted version, repre- 
sented by the matrix A — {fly } so that aij = 1 if i ap- 
peared at least once as a neighbor of j (regardless of the 
position) and Oij = otherwise. When a word repeatedly 
appears in the same text, it is considered as the same node 
in the corresponding network. But this procedure is not 
adopted for the ambiguous words under analysis, i.e. each 
occurrence is taken as a distinct node in the network, so 
that it is possible to characterize each occurrence of an 
ambiguous word to correlate its structural features with 
its meanings. 

Characterization of Senses Through Complex Networks 
Features. To characterize the local structure of an am- 
biguous word, we used a set of 16 complex network local 
The simplest measurement is the de- 
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measurements 

gree ki, i.e., the number of connections (without consid- 
ering the weight of the edges). In terms of the adjacency 
matrix A, the degree is computed as 



(1) 



The weighted version of fci, which considers the 



strength 14 of the links, is given by 



Sl 



(2) 



Extensions of these two measurements were considered 



through the analysis of further hierarchies 28 for the hi- 



erarchical expansion usually yields better network charac- 
terization [6 28 . The expansion of a given node is made 
by merging the node under analysis with its neighbors in a 
single node, keeping the external connections of the neigh- 
bors [6 28 . This procedure is then repeated to generate 
deeper expansions. This hierarchical characterization was 
adopted for both ki and si, where the m-th expansion is 
represented as km+i and Sm+i- We have not made ex- 
plicit use of fcm+i and s^+i when m — because these 
measurements take constant values as a consequence of 
considering each occurrence of an ambiguous word as a 
single node. 

In addition to the local measurements, we quantified 
the connectivity of nodes to their neighbors. Analogously 
to the adoption of further hierarchies, the study of topo- 
logical properties of neighbors also yields better network 
characterization 29 . Indeed, neighbors have played a key 

al- 
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role in many algorithms, such as the PageRank 
gorithm and its variations. In this paper, the following 
neighborhood-based measurements were employed: the 
average degree and strength of the neighbors ((A:„) and 
(s„), respectively) and their standard deviations {Akn and 
As„). Another structural measurement used was the clus- 
tering coefficient (C), which is proportional to the frac- 
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tion of triangles over the total number of connected triads. 
More specifically, the clustering is computed as: 



(3) 



It is known that a correlation exists between the number 
of semantic contexts where a word appears and its cluster- 
ing coefficient [?]. Since word senses might be related to 
the number of contexts (because distinct senses could ap- 
pear in different contexts) this measurement may be useful 
in the disambiguation task. Similarly to the degree and 
strength measurements, we expanded the hierarchies up 
to m = 3 to compute this measurement. 

The local structure was also examined with shortest 
paths (or geodesic paths) between two nodes, which are 
paths whose sum of the edge weights is minimum. If dij 
is the shortest path between nodes Vi and Vj in the adja- 
cency matrix A, then the average shortest / path length 
for Vi is: 



1 

31 E^'^. 



(4) 



In networks of text, the shortest path quantifies the cen- 
trality of a word according to its distance to the most 
frequent words We chose to use this measurement to 
verify if the distance from an ambiguous word to the core- 
content concepts of the books can be used do distinguish 
senses. Shortest paths were also employed to compute the 
betweenness of words {B). Let r]\^ be the number of short- 
est paths between nodes Vs and Vt that pass through node 
Vi. If Qst is the number of shortest paths between nodes 
Vs and Wf, then B is defined as: 



9st 



(5) 



Even though we are aware of the correlation between B 
and fc (B ~ fc**) [7 31 in large networks, the possible 
distinct values of B taken for ambiguous words will not 
reflect differences in word frequency, because k = 2 for 
each occurrence of an ambiguous word. Actually, B will 
reflect the ability of words to connect different network 
regions 31 32 or different contexts [?]. 

The 16 measurements employed to characterize the local 
structure of ambiguous words are summarized in Table 
[T] which contains the measurements classified according 
to their type (connectivity, clustering, neighborhood and 
paths), their notation and complexity. 

To verify if the description provided by the measure- 
ments above is useful for the WSD task, we used machine 
learning algorithms that induce classifiers from the train- 
ing set provided for each word. The quality of the re- 
sults was then evaluated using the 10-fold cross validation 
technique 



34 



which was chosen because it is robust in 
the sense that the training set is always different from the 
evaluation set. Thus, it prevents that overfitted inductors 



Table 1: List of measurements employed to characterize the 
local structure of nodes representing ambiguous words in com- 
plex networks. The last column indicates the time complexity, 
i.e., the time taken to compute each measurement as a function 
of the number of nodes n and edges e. For the clustering co- 
efficient, the time complexity ranges between 0(n) and O(n^) 
because it depends explicitly both on (fc) and (fc^). Further 
details regarding measurements and time complexity can be 



Group 


Notation 


Complexity 


Connectivity 


/c2, fca and 


0(n + e) 




S2, S3 and S4 


0(n + e) 


Clustering 


Ci, C2, C3 and C4 


[0(,^),0(n^)] 


Neighborhood 


(fc„) and Akn 


O(n^) 




(s„) and As„ 


0(n2) 


Paths 


I 


0(n2) 




B 


0(n2) 



take high values of accuracy rate. 



3 inductor algorithms 
were used: the C4.5 algorithm 35 , which generates trees 
based on the gain provided by each feature; the Naive 
Bayes algorithm 35 1, which uses the Bayes theorem; and 
the k nearest neighbor algorithm [35] (kNN), which clas- 
sifies an external unknown instance according to the most 
similar instance of the training database in a normalized 
space including all features. Details regarding algorithms 
and the cross validation technique are given in the SI, 



Results and Discussion. — The 10 ambiguous words 
were characterized with complex networks measurements 
to verify if senses can be inferred from a topological analy- 
sis. Table[2]shows in the second and third columns, respec- 
tively, the accuracy rate and the corresponding p-value 
Ucn relative to a classification performed by assigning the 
most common (i.e. the most frequent) sense to the am- 
biguous word. A significant accuracy (acn < 5.0 10~^) 
could be observed in 9 out of the 10 words. An example 
of scatter plot depicting the discrimination obtained for 
the word "ring" is shown in Figure [T] where each axis rep- 
resents a linear combination of the 16 measurements pro- 
vided by the Canonical Variable Analysis technique 
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These results confirm the relationship between local char- 
acteristics of adjacency networks and word senses, rein- 
forcing the suitability of complex network methods to re- 
late structure and semantics. We believe that the ability 
to distinguish senses is at least partially due to the fact 
that co-occurrence networks probably imply syntactic fac- 
that are reflected on the semantic relations 
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tors 

This relationship, however, is still difficult to establish be- 
cause there is no consolidated interpretation for the mea- 
surements of word adjacency networks (see e.g. Ref. |7j). 

Although our primary goal was not a search for the 
best possible disambiguation system, we compare our re- 
sults with the traditional approach based on the analysis 
of frequency of nearby words. Classifiers were induced us- 
ing attributes that represent the frequency of the 5, 20 or 
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Table 2: Results from characterizing ambiguous words with structural complex networks measurements. The accuracy rate 
and the p-value, considering as null model a classifier based on the most common sense, and the best classifier are shown for 
all words. The minimum set of measurements considered in each classifier is shown in the last column. Apart from the word 
"close", all the classifications achieved significant accuracy rates (ocn < 5.0 10'^). Interestingly, even though 16 measurements 
were employed in the characterization, the top classifiers were induced with 5 measurements or less. 



Word 


Acc. Rate 




Best Ind. 


Minimum Set of Measurements 


save 


87.64 % 


1.6 IQ- 


-3 


kNN 


(s„) and I 


note 


84.53 % 


4.7 IQ- 


-2 


kNN 


C3, S3 and As„ 


march 


86.95 % 


1.9 IQ- 


-3 


kNN 


C3, (fcn), cr 


present 


71.14 % 


2.2 IQ- 


-2 


kNN 


S4, (A:„), (fc„) and As„ 


jam 


100.0 % 


6.0 10' 


-3 


kNN 


C4, S4 and Afc„ 


ring 


84.61 % 


2.8 10- 


-4 


kNN 


C4, ^4, S4, Afc„ and / 


just 


51.28 % 


1.6 10' 


'2 


kNN 


C2 and C3 


bear 


61.95 % 


8.0 10- 


-3 


kNN 


C4, fcs and S4 


rock 


79.30 % 


9.7 10- 


-9 


C4.5 


^2 J S2, ^3 1 ^4 and S4 


close 


72.20 % 


8.0 10- 


-2 


kNN 


C2, S2, (s„) and As„ 
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Fig. 1: Canonical Analysis Projection for the word "ring". 
The senses considered were i) ring of a bell and ii) circle of 
metal. While the use of the sense 'circle of metal' is more 
regular as revealed by the low dispersion, the sense 'sound of 
a beir tends to be more heterogeneous. 



in each approach. Finally, in 6 out of the 10 words con- 
sidered, the traditional analysis with 5 neighbors outper- 
formed the classifiers with larger numbers of neighbors. 

Table 3: Results from characterizing ambiguous words with 
the traditional approach (second column) and with the CN ap- 
proach (third column). The p-value {atr for the traditional 
approach and a^n for the CN-based method) was computed 
considering as null model a classifier based on the most com- 
mon sense. The best classifier algorithm in the fourth column 
refers to the traditional approach. 



Word 


atr 




Best Ind. 


save 


6.2 10- 


1 


1.6 10- 


-3 


kNN 


note 


3.8 10- 


1 


4.7 10- 


-2 


kNN 


march 


1.4 10- 


2 


1.9 10- 


-3 


kNN 


present 


6.8 10- 


2 


2.2 10- 


-2 


Naive Bayes 


jam 


1.0 10- 


2 


6.0 10- 


-3 


Naive Bayes 


ring 


3.8 10- 


9 


2.8 10- 


-4 


Naive Bayes 


just 


1.8 10- 


4 


1.6 10- 


-2 


Naive Bayes 


bear 


3.3 10- 


5 


8.0 10- 


-3 


C4.5 


rock 


< 1.0 10 


-10 


9.7 10- 


-9 


Naive Bayes 


close 


< 1.0 10 


-10 


8.0 10- 


-2 


Naive Bayes 



50 words surrounding the ambiguous word. The lowest 
p- values and the top classifiers are shown in Table [3] For 
the first 5 words the CN approach outperformed the tradi- 
tional method. This means that the local structure can be 
even more relevant than the frequency analysis of neigh- 
bors. Obviously, we are not suggesting the CN approach 
to replace the approaches based on semantic information 
provided by neighbors, since complex network measure- 
ments are statistically reliable only when computed in 
large texts. Still, it could be valuable to combine both 
strategies in disambiguation systems. As for the methods 
and algorithms, the differences regarding the best classi- 
fier are worth noting. The CN approach performs better 
with the kNN algorithm, while the Nave Bayes algorithm 
is better in the traditional approach. These differences oc- 
cur probably because of the distinct number of attributes 



The contribution from each network metric in discrim- 
inating word senses was estimated by first finding the 
smallest subset of measurements generating the best clas- 
sifiers (see last column of Table [2|. Although we used 
16 measurements to characterize nodes, the best accuracy 
rates were obtained with a maximum of 5 measurements. 
Strikingly, in some cases only two measurements were al- 
ready sufficient to provide a reasonable distinction, as in- 
dicated in the scatter plot for the word "save" in Figure 
[2|. Quantitatively, the relevance of each metric (i.e. fea- 
ture) for disambiguating the words was calculated in two 
ways: using the KuUback-Leibler (KL) divergence and the 



method based on the Mann Whitney U (MWU) test [39 
While in the latter features are evaluated individually, in 
the former the interaction between features is considered. 
Thus, it is possible to identify cases where features are 
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sive set of texts, it is possible to obtain networks with the 
local characterization of nodes representing words whose 
meaning is known beforehand. Then, an ambiguous word 
of a book could be disambiguated by assigning meanings 
according to the semantic (traditional approach) and topo- 
logical features (CN approach) provided by the training 
set. In future works, we plan to use wider window sizes 
to connect words and additional complex networks mea- 
such as weighted versions of the shortest 



14 



surements 

path, clustering coefficient, and betwccnncss along with 
CN-based classification algorithms |40 to improve the per- 
formance of disambiguation systems in long texts. Also, 
we shall study the influence on the results when the pro- 
posed methodology is applied to other languages. 



Fig. 2: Scatter plot for the word "save". While one sense is 
characterized by high I and low {s„); the other sense usually 
takes low values of /. 



The authors acknowledge the financial support from 
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useful only when combined with others. Details regarding 
the KL divergence and the MWU test are given in the SI 
and in Ref. [?]. 

The rankings shown in Table [4] indicate that the rele- 
vance of a metric varies from word to word. In addition, 
a metric may be relevant when analyzed individually but 
not so if combined with other attributes because some 
features included in the first method may not be included 
in the second one. According to the KL divergence, the 
most frequent relevant metrics (i.e. the ones which appear 
among the top 3) are C3 and I. For the MWU test, the 
clustering computed at high hierarchical level (C4) is also 
a relevant feature along with S3 and S4. Therefore, these 
results suggest that meanings are often correlated with the 
strength (or frequency) of higher-order neighbors and with 
the degree of interconnection of neighbors. Interestingly, 
/ was not so relevant when combined with other features 
as it appeared only a few times in the MWU ranking. 

Conclusion. — In this paper we have verified the suit- 
ability of the complex network model for the word sense 
disambiguation task in large texts. Upon characterizing 
the local structure of nodes representing ambiguous words, 
we obtained significant discrimination, which means that 
different senses affect the structural organization of com- 
plex networks. Strikingly, the discrimination was so effec- 
tive for some words that the topological characterization 
outperformed traditional shallow methods. In general, the 
hierarchical characterization of the clustering and connec- 
tivity measurements were the most relevant features for 
WSD, even though the ranking of metrics varied from 
word to word. The analysis here may shed light on the 
relationship between structure of complex networks and 
semantics. From a practical standpoint, the methodology 
described might be useful in hybrid approaches to improve 
state-of-the-art disambiguating systems. Given an exten- 
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Mann- Whitney U test 
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2nd 


3rd 




2st 
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S3 


7 

k2 


ki 
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C4 
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Ci 
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ki 
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C3 




k3 
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S3 


ki 
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a 
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Si 


ki 


ring 


/ 






a 


Ci 


ki 


S3 


I 
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(Sn) 


Ci 


{kn) 


Si 


k3 
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I 


Si 


{kn) 


ki 


note 




Si 




S3 


a 




ASn 


S2 


save 


I 






ASn 
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Ci 


rock 
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{kn) 
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S2 


S4 


S3 
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